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Abstract — In distributed storage systems built using com- 
modity hardware, it is necessary to have data redundancy in 
order to ensure system reliabihty. In such systems, it is also 
often desirable to be able to quickly repair storage nodes that 
fail. We consider a scheme — introduced by El Rouayheb and 
Ramchandran — which uses combinatorial block design in order 
to design storage systems that enable efficient (and exact) node 
repair. In this work, we investigate systems where node sizes 
may be much larger than replication degrees, and explicitly 
provide algorithms for constructing these storage designs. Our 
designs, which are related to projective geometries, are based 
on the construction of bipartite cage graphs (with girth 6) 
and the concept of mutually-orthogonal Latin squares. Via 
these constructions, we can guarantee that the resulting designs 
require the fewest number of storage nodes for the given 
parameters, and can further show that these systems can be 
easily expanded without need for frequent reconfiguration. 

I. Introduction 

Recent trends in distributed storage systems have been to- 
ward the use of commodity hardware as storage nodes, where 
nodes may be individually unreliable. Such systems can still 
be feasible for large-scale storage as long as there is overall 
reliability of the entire storage system. Recent research in 
distributed storage systems has focused on using techniques 
from coding theory to increase storage efficiency, without 
sacrificing system reliability and node repairability [1 1. 

In this work, we consider storage systems where failed 
storage nodes must be quickly replaced by replacement 
nodes. To achieve short downtimes, we consider techniques 
where the repair of a particular node (i.e., by obtaining 
replacement data) is via contacting multiple non-failed nodes 
in parallel — where each contacted node contributes only a 
small portion of the replacement data. Such replacement 
strategies have been studied in the context of both functional 
repair [I] — where replacement nodes serve functionally for 
overall data recovery — and exact repair — where replacement 
nodes must be exact copies of the failed node. 

We build upon the work of El Rouayheb and Ram- 
chandran 111, who propose a storage system allowing for 
exact repair Using the idea of Steiner systems fj], the 
authors design distributed storage systems with the desired 
redundancy and repairability properties — where even though 
each storage node is responsible for storing multiple data 
chunks, replacement of any failed node is always possible 
by obtaining only a single data chunk from each of several 
non-failed nodes. In systems where multiple nodes can be 
read in parallel, then such a scheme ensures high availabihty. 



even in the presence of node failures. Moreover, since the 
scheme described in |2| stores data in an uncoded manner, 
for computing applications the storage nodes may also serve 
as processing nodes. 

A Steiner system S{t, k, v) specifies a distribution of v 
elements into blocks of size k such that the maximum number 
of overlapping elements between any two blocks is t — 1 
(so if i = 2, then no two blocks can share any pairs of 
element^. For instance. Example [T] shows a Steiner system 
and the resulting distribution of data chunks to storage nodes. 

Example 1. Consider a distributed storage system to store 
9 total data chunks, where each chunk is stored within storage 
nodes that can hold 3 chunks each. Then it is possible to 
distribute the chunks across 12 nodes, where every chunk 
has exactly 4 replicas and any two distinct nodes share at 
most only one overlapping chunk. This is shown in Figure [7] 
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Fig. 1. Storage design from Steiner system S{2, 3, 9); same as (2j Fig. 6(a)]. 

In most practical distributed storage systems, however, it is 
often desirable for the number of data chunks per nodeQ to be 
much greater than the replication degree of each chunk. For 
example, the Google File System |4| — which stores data in 
chunks of as small as 64 MB each — has a replication degree 
on the order of three replicas but may store thousands of 
chunks on each storage node. Thus in this work, we consider 
a graph-based construction of Steiner systems where the 
replication degree and node size are significantly asymmetric. 

Specifically, we construct storage systems where the repli- 
cation degree of each data chunk is q+1, whereas each node 
may store up to g" + g""^ + ■ ■ ■ + + q+l chunks (for any 
given integer n). Although it is known from the theory of 
projective geometries ||3] that systems with these parameters 
can be designed, by using our graph-based method we are 
able to give a systematic construction that is highly scalable; 

'in tlie rest of this paper, wlienever we use tiie term Steiner system, we 
are referring to Steiner systems with t = 2. 

^For brevity, we refer to the number of chunlts per node as the node size. 



for a system constructed according to the methods in this 
paper, it is always possible to increase the storage system 
without moving any existing data chunks — and still be able 
to preserve the property that no pairs of chunks recur in more 
than one storage node. 

Our construction is based on relating Steiner system prob- 
lems with the problem of shortest cycles on bipartite graphs. 
More specifically, our systems arise from the construction of 
cage graphs {5], which are graphs with the minimum number 
of vertices for a given allowable shortest cycle length and 
other specified conditions on the vertex degrees. Because we 
are constructing cage graphs, we further know that for a given 
desired node size and replication degree, our constructions 
are the smallest possible systems (in terms of total number of 
storage nodes and total number of data chunks stored). This 
is useful for the practical application of such constructions, 
as it immediately translates into least hardware cost for the 
desired system requirements. 

A. Related Work 

The problem of distributed storage with efficient repair is 
discussed in |[T|. Using network coding, the authors propose 
a scheme for storing data where node repair is functional. 
Dimakis et al. 1 1 1 also define the idea of a storage-bandwidth 
tradeoff, and discuss ways to implement either minimum 
storage or minimum bandwidth systems. Even though exact 
repair of storage nodes is sometimes necessary, the storage - 
bandwidth tradeoff under exact repair is not yet fully under- 
stood. Building upon the network coding constructions of [T|, 
Rashmi et al. [6] give a scheme for achieving the minimum 
bandwidth operating point under exact repair, finding a point 
on the storage-bandwidth tradeoff curve. 

El Rouayheb and Ramchandran |[2l introduce a related 
scheme, termed fractional repetition codes, which can per- 
form exact repair for the minimum bandwidth regime. They 
then derive information theoretic bounds on the storage 
capacity of such systems with the given repair requirements. 
Although their repair model is table-based (instead of random 
access as in [JJ), the scheme of [2J has the favorable 
characteristics of exact repair and the uncoded storage of 
data chunks. Randomized constructions of such schemes are 
investigated in [7|. 

Uncoded storage has numerous advantages for distributed 
storage systems. For instance, uncoded data at nodes allows 
for distributed computing (e.g., for cloud computing), by 
spreading out computation to the node(s) that contain the 
data to be processed. Upfal and Widgerson [T| consider a 
method for parallel computation by randomly distributing 
data chunks among multiple memory devices, and derive 
some asymptotic performance results. In contrast, our designs 
are deterministic, and we are also able to guarantee the 
smallest possible size for our storage system. Furthermore, if 
uncoded data chunks are distributed among the nodes accord- 
ing to Steiner systems, then load-balancing of computations 
is always possible. 

Steiner systems are an example of balanced incomplete 
block design (BIBD), within the field of combinatorial design 



theory ||9l. Some parameters for which Steiner systems can 
be designed are given in fTOl, f2]. In this work, we consider 
Steiner systems similar to those from finite projective planes. 
Specifically, designs in which the replication degree is q + 1 
and with each storage node storing up to q" + q"^^ + • ■ • + 

+q + l data chunks can also be found from the projective 
geometry PG{n + l, q) — where the data chunks are the lines 
and the storage nodes are the points of the corresponding 
space. However, in this work we show that via our recursive 
graph construction method, it is possible to initially deploy 
small storage systems without needing to know a priori the 
future maximum extent of the storage system — while still 
being able to preserve the Steiner property in subsequent 
expanded systems^ This alternate approach for constructing 
projective geometries has tremendous benefits for practical 
storage system designs, as otherwise the connection between 
system design and the construction and extension of such 
geometries is not immediately obvious. Furthermore, our 
graph-based construction is simple to implement, and designs 
are uniquely determined given knowledge of the base set of 
mutually-orthogonal Latin squares (which we discuss later). 

In addition to IS], the use of BIBDs for guaranteeing load- 
balanced disk repair in distributed storage systems is also 
considered in 1 J, [12J, for application to RAID-based disk 
arrays. In ifTZI . the authors discuss how block designs may be 
used to lay out parity stripes in declustered parity RAID disk 
arrays. The block designs from our work may be helpful for 
distributing parity blocks in this scenario, in order to build 
disk arrays with good repair properties. 

Certain block designs may also be applicable to the design 
of error-correcting codes, particularly in the construction 
of geometrical codes [13, Sections 2.5 and 13.8]. Graphs 
without short cycles have been considered in the context of 
Tanner graphs lfT4l . and finite geometries in particular have 
been considered in the context of LDPC codes (15]. Block 
designs and their related bipartite graphs are also considered 
in code design for magnetic recording applications in llT6l . 

B. Outline of Paper 

In the next section, we provide necessary background. 
Section |lll] illustrates how our constructions work, through 
the construction of regular bipartite cage graphs; this con- 
struction provides a base upon which the larger construction 
of Section JV] is built. In Section |IV] we give the main 
contribution, which is the design of scalable storage systems 
that can be expanded readily. Finally, Section W\ concludes. 

II. Preliminaries 

A. Notation 

When describing parameters for constructible graphs we 
let p„(g) = + q"-! + . . . + g2 ^ q + 1 = for 
n G In the rest of this paper, q will always denote 

either a prime number or a power of a prime. 

^We do not describe this in detail, but very similar graph-based methods 
can also be used to constmct designs related to the affine geometry AG{n + 
l,q). These constructions are just as expandable as the projective geometry- 
based designs. A brief note on these constructions is given in Section HV-d 



In this work, we consider simple undirected bipartite 
graphs G = {X,Y,E). Cardinality is denoted by | • |. For 
a vertex x, deg(a;) gives the number of incident edges. We 
only consider graphs where all of the vertices in a vertex set 
have the same degree, so we can write deg{X) = deg(a;) for 
some X ^ X. The symbol ~ is used to denote an edge; for 
vertices x and y, we say that a; ^ y if and only if {x,y) G E. 

B. Graph Interpretation of Steiner Systems 

A Steiner system is a collection of elements, V, into 
blocks, B, where any subset of elements only occurs once 
in the block collection. We reinterpret the Steiner system 
requirements by considering its incidence graph Ill7' j. In this 
work, we will consider bipartite graphs G — (X, Y, E), 
where there are u vertices in X, each of degree fc, and 
V vertices in Y , each of degree I. We call such a graph to be 
biregular when k ^l. Clearly, Iv = uk. 

Now we label the vertices of Y as the elements of V 
(i.e., V = {yg I g — 0,l,...,u — 1}), and the vertices 
of X as the blocks of B. Consider a particular vertex Xh 
(where h G {0,1,..., u — 1}), and define block hh — 
{yg ^ Y \ yg ^ Xh}- Then the collection of blocks, 
B — {hh I /i = 0, 1, . . . , w — 1}, satisfies the following: 

1) Each element yg G V occurs in exactly / blocks of B. 

2) Each block hh G B contains k elements. 

It is clear that whenever two blocks hh and hh' share some 
pair of elements yg and j/g/, then this is equivalent to the 
4-cycle Xh ^ yg ^ Xh' ^ yg' ^ Xh- Thus the nonexistence 
of such 4-cycles is equivalent to the nonexistence of shared 
pairs of elements between blocks. In Figure |2] we show the 
bipartite graph associated with Example [T] 
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Fig. 2. Bipartite grapli of Steiner system corresponding to fc = 3, Z = 4. 

In the above, we construct Steiner systems where I is the 
repetition degree, k is the block size, v is the total number 
of elements, and u is the total number of blocks. In the rest 
of this paper, we shall always assume that I > A;0 

Since X and Y are interchangeable, we could instead let 
X be the elements and Y be the blocks of another block 
systemjfl resulting in the transpose codes of |2|. Since 
for practical cases we wish to construct distributed storage 
systems where the repetition degree is smaller than the block 
size, we will more often employ the transpose code. To stay 
consistent with Z > fc, in these cases we let fc be the repetition 
degree and I be the block size. Under this interpretation, u is 
the number of elements and v is the total number of blocks. 

C. Cage Graphs 

In an undirected graph G — {V, E), a cycle of length d is 
a set of d vertices connected in a closed path. In the bipartite 

'^We can construct systems wliere fc > i by swapping the two vertex sets. 
^In the language of finite geometries, interchanging the roles of elements 
and blocks is the same as interchanging points and lines. 



graph G ~ {X, Y, E), such a cycle must necessarily alternate 
between vertices of X and vertices of Y; thus any cycles 
must have even length. The girth of a graph is defined as the 
length of the shortest cycle in the graph. 

Then, a d-cage is a girth-d graph with minimum number of 
vertices for a particular desired degree distribution [5 ,, [ 17j. 
The goal of this work is to construct biregular cages of girth 6 
(so that no 4-cycles are present) — in order to construct the 
smallest possible Steiner system with the desired parameters. 
Using transpose codes we can then construct systems requir- 
ing the fewest possible storage nodes (i.e., smallest v) and 
the least number of total distinct chunks (i.e., smallest u), 
while still having the desired repetition degree k and block 
size I. Such systems will meet the lower bound of Lemma [TFI 

Lemma 1. Consider a simple biregular bipartite graph 
{X, Y, E) that does not have any cycle of A or fewer vertices. 
If deg{X) = fc and dcg(y) ~ I (where I > k), then the 
number of vertices, v ^ \Y\ and u = \X\, has lower bounds 

V > l + l{k-l) (1) 
u > l + l{l~l){k-l)/k. (2) 
Proof: We sketch the proof for ([T]i here; a more detailed 
proof of ([TJ as well as for (|2]i is provided in Appendix IB-AI 

One method for constructing the bipartite graph is by 
starting with a single vertex y G F (called the layer 
vertex) and connecting it to I vertices of X (called the layer 1 
vertices). These vertices of X must be connected to fc — 1 
distinct other vertices of Y (the layer 2 vertices). Note that 
any remaining vertices of X (the layer 3 vertices) would then 
need to be connected back to the layer 2 vertices of Y in such 
a way as to preserve the nonexistence of 4-cycles. ■ 

Any bipartite cage achieving the lower bounds of Lemma[T] 
satisfies the Steiner system property that each pair of ele- 
ments occurs in exactly one block. We already know that 
every pair of elements occurs in at most one block. Since 
V — 1 + l{k — 1) and u ^ I + l{l — l)(fc — l)/fc also satisfies 
(2) ~ ^(2)0 know that every pair of elements occurs in 
at least one block — and therefore occurs in only one block. 

The proof of Lemma [T] gives us clues on how to construct 
bipartite graphs that achieve the lower bounds — which must 
necessarily be cage graphs. We will show how to avoid 
introducing 4-cycles between the layer 2 and layer 3 ver- 
tices, by considering the use of mutually-orthogonal Latin 
squares (see Appendix |A] or |20|). Specifically, in order 
to construct the cage graphs, we will require the exis- 
tence of a set of q mutually-orthogonal q x q squares, 
{L("',i(i),...,L(«-i'}, where is a square with ev- 
ery column in natural order, and L^^^, L'^^\ . . . , L^''"^^ are 
mutually-orthogonal Latin squares where each square has its 
zeroth column in natural order. Such a set always exists when 
q is a prime or a prime power We give an example for g = 3: 

'The result of Lemma[T]is sometimes known as a Moore-type bound fl8f, 
although we note that the bound in is tighter than the corresponding 
bound in |19i for our case, when I > k. 

^The condition (j) = "(2) comes from the fact that there are a total of 
(2) pairs of elements, which should correspond exactly to the sets of (j) 
pairs of elements in each of the u blocks. 



Example 2. A set of 3 mutually-orthogonal 3x3 squares is 
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III. Regular Cage Graphs 

We now show how to construct giith-6 bipartite cage 
graphs where the degrees of both vertex sets are equal. 
More specifically, the vertex degrees will satisfy deg{X) = 
dGg(y) = q + 1 (i.e., k — I = q + 1), where q is any 
prime or power of a prime. The resulting graphs will have 
\X\ = \Y\=q^ +q + l. 

A. Construction of Regular Cage Graph 

The construction of regular bipartite cage graphs of girth 6 
is inspired from the construction in Wong [5 1, and is given 
in Algorithm [U Bipartiteness arises from the construction. 

Algorithm 1 Construction of bipartite cage wlien k = I = q + 1 
1: [Layer 0] Start witli a single vertex j/o G ^■ 

2: [Layer 1] Connect yo to I = q + I vertices of X. Witliout loss of 

generality, call these vertices xq, xi, . . . , xi^i. 
3: [Layer 2] For each vertex Xj, j = 0,1, ... ,1 — 1, connect Xj to 
k — 1 = q vertices of Y. Let yj,m, m = 0, 1, . . . , fc — 2, denote the 
vertices of this step that are connected to vertex Xj. 
4: [Layer 3] Connect each vertex yo,m {th = 0, 1, . . . , A; — 2) to i — 1 = q 
distinct vertices of X, called x^.i, i = 0, 1, — 2. Therefore, 
^m,i 7^ S^m' i' unless m = m' and i = i'. There will be {k — — 
1) = q^ such vertices x^.i- 
5: Consider a vertex where m G {0, 1 

to vertices y 



{0, 1, . . . ,1 — 2}. Connect x„ 
0, 1, 2. 



k - 2}, i G 
(„), where j = 



The q^ + q layer 2 vertices yj.m, j — 0,1,..., g 
and m — 0, 1, . . . , q — 1, coincide with the vertices 
yi,y2,...,yq2+q, and can be mapped using yjq+m+i 
yj^m- Similarly, the q^ layer 3 vertices Xra,i, ni — 
0,1, ... ,q — 1 and i — 0,1, . . . ,q — 1, coincide with the 
vertices Xq+i,Xq+2, ■ ■ ■ ,Xq+q2, and can be mapped using 

Xq-\-mq+i+l — *^m,i- 

Notice that the resulting graph consists of the layer and 
layer 2 vertices on one side of the graph, connected only to 
layer 1 and layer 3 vertices on the other side. 

We first show an example of Algorithm [T| with k — A and 
I = 4 (so q — 3), before proving that this indeed results in the 
desired cage graph. This graph will have \X\ ~ \Y\ = 13. 

The first three steps are straightforward, as they involve 
connecting the vertices of layers 0, 1, and 2 in a tree. Step|4] 
connects all the vertices associated with y^.m with the 1 — 1 = 



q vertices Xr. 



0, 1, . . . , q — 1. This gives Figure [3a] 



Now we consider connecting the other outgoing edges of 
each x„i,i vertex to the remaining jjj^^ vertices, j ^ 0. The 
set of mutually-orthogonal squares of order 9 = 3, given in 
Example |2l guarantees that 4-cycles do not get introduced in 
step ID Figure |3b] shows the resulting bipartite cage graph. 

B. Properties of Graph Constructed from Algorithm\l\ 

We show that the graph constructed from Algorithm [T] is 
indeed a cage graph, as well as discuss additional properties. 

Lemma 2. In the bipartite graph constructed from Algo- 
rithm [7] the shortest cycle consists of at least 6 vertices. 
Proof: See Appendix IB-Bl or 15i Section 4]. ■ 




(a) Conclusion of step [4] (b) Final bipartite graph. 

Fig. 3. Construction of bipartite cage where k = I = A, using Algorithm[T] 

Theorem 3 (see also |5, Section 4]). The regular bipartite 
graph constructed from Algorithm\l\is a bipartite cage graph 
of girth at least 6, with degree q + 1 at all vertices. 

Proof: Algorithm [T] results in l + ;(fc-l) = q'^ + q+l 
vertices for Y and I + l{l ~ l){k — 1) / k = q'^ + q+l vertices 
for X — where every vertex has degree q + 1. Thus w |F| 
and u = jXl achieve the lower bounds of Lemma [T] for the 
required degree distributions. By Lemma|2] the shortest cycle 
has at least 6 vertices, so the result is shown. ■ 

By interpreting Y as the elements and X as the blocks, we 
have constructed a S{2, k, v) = S{2, g + l, g^ + g + l) Steiner 
system — and also a corresponding storage system design. 

We see that in order to generate the cage graph and 
associated block system, the only required information is the 
generator element used to generate the multiplicative group 
for the finite field — as the set of mutually-orthogonal squares 
can then be uniquely determined. Thus lookup tables for the 
entire block design need not be stored, since the tables can 
always be generated easily. 

In fact, the constructibility of a regular cage graph with 
q'^ + q + 1 vertices in each vertex set is equivalent to the 
constructibility of a projective plane of order q+1 111]. The 
regular cage graph with k = I = 3 is the Heawood graph; 
see Figure m which also shows the associated Steiner system. 
This construction of the Heawood graph is analogous to the 
Skolem construction |9| of Steiner triple systems for v = 9. 
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(a) Bipartite graph. 



(b) Block design. 



1 3 


5 




1 4 


6 




2 3 


4 




2 5 


6 


n. 



Fig. 4. Steiner system corresponding to fc = 3, i = 3. Figure |4al visualizes 
the system as a bipartite graph, and Figure l4b] shows the block design. This 
gives the same Steiner system as in |2, Fig. 3 (Example 2)]. 

We also mention that similar methods can be used to 
construct regular graphs (i.e., k = I = q -\- 1) of girth 6 
when q is not a prime power (e.g., see 11211 . where q = 6). 

IV. Scalable Designs 



In Section lTV-AI we construct cage graphs where the vertex 
degrees of the two vertex sets are highly unbalanced, i.e., 
where deg{X) = fc = q + 1 but deg(y) = / = Pn{q). We 
discuss some favorable scalability properties in Section HV-BI 



A. Construction of Designs with k = q + I, I = Pniq) 

The construction here is recursive; thus we call l\n] = 
Pniq) as the degree of the vertices in Y, at iteration njj (For 
notational simplicity, if no iteration number is specified, then 
it is assumed that we are referring to the quantity for the n-th 
iteration.) The constructed cages have \X\ — P"+^^'^"'^i'> 
and \Y\ = pn+i{q)- We show such a graph in Figure |5] 




^ »^ >^ 



Fig. 5. Construction of bipaitite cage graph with k = q + I, I = pn{q)- 

We will inductively construct bipartite cages with k = q+1 
and l[n] — Pn{q) using a layered method similar to before. 
Notice that for n — 1, the graph is the k — I — q + 1 cage. 

Thus suppose that a cage graph with parameters k = q+1 
and ^[n— 1] — pn-i{q) exists. For this graph, 1] — Pn{q) 
and u[n — 1] = P"fa)^P^-ifa) , By taking Y as the elements 
and X as the blocks, this gives a Steiner system with block 
size k = q+\ and with — 1] = Pn{(l) total elements, i.e., 
5(2, g + l,Pn(<z))- (Here, each element is repeated l[n — 1] = 
Pn-i{q) times, and there are u[n— 1] = blocks.) 
This system can then be used to construct the k = g + 1, 
I = Pn{q) cage — as given in Algorithm |2] 

Algorithm 2 Construction of bipartite cage when fc = g + 1, i = p,i(<?) 

Require: A set of v[n — 1] = Pn{q) elements, and a collection B = 
{bh I h = 0, 1, . . . ,u[n ~ 1] — 1} of {q + l)-eleinent blocks bf^, 
such that each element has exactly /[n — 1] = Pn— 1(9) replicas and no 
particular pair of elements occurs in more than one block. 
1: [Layer 0] Start with a single vertex j/o G ^■ 

2: [Layer 1] Connect yo to I = Pn{q) vertices of X. Without loss of 
generality, call these vertices xo,xi,. . . ^ ^p„(q) — i- 

3: [Layer 2] For each vertex Xj, j = 0,1, ... ,1 — 1, connect Xj to 
k — 1 = q vertices of Y. Let yj^m, m = 0, 1, . . . , fc — 2, denote the 
vertices of this step that are connected to vertex xj. 

4: for /i = to u[n - 1] - 1 do 

5: Let the block bi^ consist of elements fe^ = {g0i9i,S2, • ■ ■ 
6: [Layer 3] Connect each vertex ygQ,m (jn = 0, 1, . . . , fc — 2) to g 



distinct vertices of X, called x. 



(h) 



0, 1, 



,q — 1. Therefore, 



7: 



7^ I unless m = m' and i = i' . (For the h-\h iteration, 
there will be a total of (fe — l)q 

-Jh) 
''m.i 



2 1 -(h) , 

■■ q such vertices x^ 



,q-l}. Connect xl^\ to y 



9j + l 



{0,1,... 



,k - 2}, i 
0,1,. ..,q- 



{0,1 
8: end for 

Ensure: Bipartite cage with degrees k = q + 1, l[n] = pn{q), and nuinber 
of vertices \Y\ = v[n] = \X\ = u[n 



Pn + l{l)Pn(g) 



9+1 



Algorithm |2] differs from Algorithm [T] in steps |6] and |7] 
because we only connect via ,m (where j — 0,1, ... ,q) 
instead of i/j^m for all j = 0, 1, — 1. This is due to 
only considering ((7 + l)-element subsets instead of the entire 
set of xq,. . . ,xi[n] vertices when constructing each smaller 
subcage. 

*We let u[n], v[n] denote the respective quantities at iteration n. Since 
kin] = q + 1 for all n, we do not qualify k with the iteration number n. 



For each iteration h where we select the subset of layer 1 
vertices denoted by = {go, 51, 52, ... , Qq}, let us call the 
bfi-subgraph as the subgraph induced by the subset of vertices 

{yo} u {xj I j e 6,1} u {yj^,n |j e feh, m = 0, 1, . . . , - 2} 



u{ 



m = 0,l,...,fc-2, i = 0,l,...,q-l}. 



Lemma 4. The graph ofAlgorithm\2\has the desired number 
of vertices, \X\ and \Y\, and satisfies the degree require- 
ments. 

Proof: This can be shown via careful accounting. We 
provide the complete proof in Appendix IB-CI ■ 

Lemma 5. In the constructed bipartite graph of Algorithm^ 
the shortest cycle has length of at least 6 vertices. 

Proof: As there are neither odd cycles nor length- 2 
cycles, we only need to check that there are no length-4 
cycles. Since each selection of layer 1 vertices induces a 
subgraph which is isomorphic to the k = I — q + 1 bipartite 
regular cage graph, any properties from the regular graph 
also hold for the subgraph. Thus within any 5;i-subgraph, 
there are no 4-cycles. 

Consequently, any potential 4-cycle must involve only 
edges from layer 2 to layer 3 vertices, where the layer 2 
vertices are connected to different Xj vertices of layer 1. 
Suppose that the layer 2 vertices ijj^^ and yj\f_i', where 
j ^ j', are involved in a 4-cycle with the layer 3 vertices x^'*^ 
and x',^, '^,0 Such a cycle implies that the 6;^ -subgraph must 
include the edge between yj^fj_ and xlll\, as well as the edge 
between yj',^' and x^^''^; also, the b/j/ -subgraph must include 
the edge between j/j.^ and as well as the edge between 

This means that the subsets bh and bh' both 



yj'^^i and i;„j/ 'j/. inis means mat me sunsets anu Oh' 
contain the elements j and j' . However, since bh and bh' are 
two subsets that do not share any pair of elements, the fact 
that j, j' e bh and j, j' e bh' is a contradiction. ■ 

Lemma 6. Supposing that a bipartite cage (of girth 6j with 

parameters k — q+1, l[n — l\ = pn-i{q), v[n—l] = Pniq), 
and u[n — 1] ~ exists, then Algorithm |2] con- 

structs a bipartite cage (of girth 6j with parameters k = q+1. 



l[n\ ^Pn{q) (andv[n] =p„+i(g), u[n\ 



Proof: Follows from Lemmas H] and |5] 



9+1 



)■ 



Theorem 7. A bipartite cage of girth 6, with parameters 
k = q + 1 and l[n\ = Pn{q), exists and is constructible. This 
graph has v[ri\ — pn+i{q) and = ^"^^^^ff"^'^'* - 

Proof: The base case where n = 1 is the k = I = q + 1 
cage graph from Algorithm [T] and so is constructible. The 
conclusion follows by induction, using Lemma |6] ■ 

In Figure |6] we show the resulting storage system design 
after iteration n ~ 2, for the case q = 2 (i.e., k — 3). This 
system is in fact an extension of the k = 3, 1 = 3 block design 
of Figure |4l for storage nodes bo,bi, . . . ,be, the first 3 data 
chunks in each node are exactly the same between Figures 
|4b] and |6] This scalability will be explained in Section IIV-BI 

These cage graphs form a family of designs where k = 

'We know h 7^ h', or else the 4-cycle is entirely within the b/i-subgraph. 



bo 1 2 7 8 9 10 h 2 3 4 27 30 31 34 6io 8 13 14 24 26 32 34 



3 6 11 1415 18 i)6 



2 5 6 28 29 32 33 



9 15 17 19 20 31 32 



4 5 12 13 16 17 i>7 



7 11 13 19 21 27 29 



1 3 5 19 22 23 26 6s 



7 12 14 20 22 28 30 



1 4 6 20 21 24 25 



8 11 12 23 25 31 33 



9 16 18 21 22 33 34 



10 15 16 23 24 27 28 



10 17 18 25 26 29 30 



Fig. 6. Block design for distributed storage system corresponding to fc = 3 
and / = 7. Eacli data cliunk has 3 replicas and each storage node stores 
7 chunks. In total there are 35 distinct data chunks and 15 storage nodes. 
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and thus 



q+l, I = Pniq), V = pn+iiq), and u ■ 

are coincident with the Steiner systems S{2, q + l,pn+i{q))- 

B. Advantages of Scaled Constructions 

The construction of Algorithm |2] is not merely a construc- 
tion for a cage graph with large degree l[n] for the vertices 
of Y . This particular construction also allows for the easy 
expansion of storage systems built using these methods. That 
is, if an extant system has l[n — 1] = p„_i(g), it is relatively 
simple to increase the size of the system so that the degree 
of Y has l[n\ — Pn{q)- This is because the following holds: 

Theorem 8. Consider a cage graph, G[ri\, with parameters 
k = q + l, l[n\ — Pn{q) constructed in iteration n of 
Algorithm |2] The cage graph with parameters k — q + 1, 
l[n — l] = pn^i{q) (i.e., constructed in the previous iteration 
of Algorithm^ and called G[n — 1]) is a subgraph of G[n]. 

Theorem |8] will be proved with the help of Lemma |9] 

Lemma 9. Consider a cage graph with k = q + l, 
l[n] — Pn{q), to be constructed in the n-th iteration of 
Algorithm^ From the set of pn{q) elements and the col- 
lection of blocks B[n\, it is possible to select a subset of 
Pn-i{q) elements, called S[n — 1], such that the subcollection 
of blocks from B[n\ that contain only elements from S[n — 1] 
is [isomorphic to] the entire collection of blocks B[n — 1] 
required in the (n — l)-th iteration of the algorithm. 

Proof: Here, the elements are Y and the blocks are X. 
We now prove by induction. 

The base case is n = 2. The cage graph with parameters 
k = q+l, l[l\ = q+l is the graph from Algorithm[T] In order 
to construct the cage graph with parameters k = q + l, l[2] = 
q'^ + q + 1 during iteration 2, we choose elements from the 
collection of blocks B[2] — X[l] (i.e., the block collection B 
at iteration 2 corresponds to the vertex set X at iteration 1). 
By construction, the block &o G B[2] contains q + l elements, 
so the subgraph associated with bo is isomorphic to the cage 
graph with parameters k = q + 1, l[l] = q + 1. 

Now consider an arbitrary iteration n. From the (n — l)-th 
iteration, we know that B[n — 2] C B[n — 1] (up to 
isomorphism with appropriate indexing of elements). Since 
B\n ~ 1] is used in iteration n of Algorithm |2] for choosing 
subsets of layer 1 vertices to construct G\n], and B[n — 2] was 
used in iteration n—1 for choosing subsets of layer 1 vertices 
to construct G[n - 1], then the fact that B[n - 2] C B[n ~ 1] 
results in G[n — 1] being a subgraph of G[n\. The block 



systems corresponding to G[n — 1] and G[n] thus satisfy 
B[n — 1] C B[n] (again, up to isomorphism with appropriate 
indexing). Because the blocks of B[n— 1] contain a total of 
p„-i{q) elements (i.e., \{y eb\be B[n - 1]}| = Pn-i{q)), 
the result is shown. ■ 

Now we can prove Theorem |8] 

Proof of Theorem |S} From Lemma |9] one can select 
Pn-i{q) layer 1 vertices such that the block system consisting 
of only these vertices is isomorphic to — 1]. The subgraph 
constructed through these layer 1 vertices is thus isomorphic 
to the graph of the previous iteration, G'[n — 1]. ■ 

For the distributed storage system, we take Y as the blocks 
and X as the elements. Thus each element has k = q + 1 
repetitions and each block has size I — p„ (g) (such a system 
requires a total of v — pn+i{q) storage nodes and stores 
a total of li = ^"^^g^''^'"^'^'' distinct data chunks). From 
Theorem |8] we see that because G[n — 1] is a subgraph 
of G[n\ — where the subgraph is a truncation of outgoing 
edges from each Y vertex — this means that the blocks of 
size Z[ri — 1] = pn-i{q) are truncations of the blocks of size 
l[n] = Pn{q). Equivalently, if we have constructed (using 
Algorithm |2]i the storage system with block size I [n — 1] = 
Pn-i{q), then expanding to a storage system with block 
size l[n] — Pn{q) can be accomplished by appending the 
remaining outgoing edges from each Y vertex. No elements 
need to be moved from the existing system, and yet the 
Steiner property (of no repeating pairs of elements) will still 
hold — one need only append new elements to the appropriate 
blocks. For instance, the expansion of the system of Figure l4bl 
results in the appended storage system of Figure |6] 

It is similarly simple to construct a storage system which 
has total number of elements, u, that is between the valid 
quantities u[n — 1] and u[n] (i.e., u[n — 1] < u < u[n]). 
One should construct the system for u[n] elements (i.e., 
k = q + 1 and l[n] = Pn{q)) and then leave empty 
slots in the blocks which are supposed to store elements 
Xu,Xu+i,Xu+2, ■ . ■ ,Xu[n]-i- This will preserve the Steiner 
property and also allow expansion of the storage system until 
u[n] elements arrive. 

C. Other Scalable Constructions 

Due to space constraints, we do not discuss the construc- 
tion of a related class of block designs, which are those that 
coincide with affine geometries |3|. A similar construction 
to Algorithm [T] can be used to construct cage graphs where 
k ~ q and I = q + 1 — leading to the graph of Figure |2] when 
q ~ 3. From this base case, similar scalability results can be 
derived for storage system designs with k = q and / Pn{q)- 

V. Conclusion 

In this paper, we give practical, scalable, and imple- 
mentable constructions of bipartite cage graphs where the 
vertex degrees are highly asymmetric. This allows for the 
design of distributed storage systems based on Steiner sys- 
tems, where the number of replicas of each data chunk 
may be much smaller than the storage node size. Using our 
constructions, a system designer can guarantee that a system 



consuming the least amount of resources (e.g., fewest number 
of storage nodes) has been deployed, and also be able to 
easily expand the storage system when necessary. 

We further comment that the chunk distribution schemes 
given by our cage graph construction method can also be used 
to guarantee collision resistance in existing storage system 
implementations. As an example, for storage systems imple- 
menting distributed hash tables (DHTs) — such as CAN [22], 
Chord [23], Pastry [24J, and Tapestry [25] — when the desired 
replication degree and number of storage nodes are known, 
then the chunk and replica locations from the appropriate 
block design may be used as the hashing function. 
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Appendix A 
Latin Squares 

We discuss Latin squares and mutually-orthogonal Latin 
squares — which will aid in the construction of bipartite cage 
graphs of girth 6. A comprehensive treatment of Latin squares 
can be found in the text by Denes and Keedwell t20l . 

Definition 1. Consider a q x q matrix L where the entries 
take on values from Q = {0, 1, 2, . . . , q — 1}. Then L is a 
Latin square if for every row i, the entries satisfy Li,j ^ Li,j' 
whenever j ^ j'; and for every column j, the entries satisfy 
Li,j ^ Liij whenever i ^ i' . 

Definition 2. A column j of a square L is considered to 
be in natural order ;/ the symbols {0, 1, . . . , q — 1} occur in 
sequential order, i.e., Lij = i for i = 0, 1, . . . , g — L 

In fact — given any Latin square — by labeling symbols and 
permuting columns appropriately, we can establish a Latin 
square with a specified column in natural order Next we 
define the concept of orthogonality for Latin squares. 

Definition 3. A pair ofqxq squares, L^™\ L^™ \ is consid- 
ered orthogonal if the set of ordered pairs of elements satisfies 
{(Lg\Lg')) \i,j^Q}^ {{a,b) I a,6 G Q}. Thus 
and L^™ ^ are orthogonal if the pairwise catenation of the 
two squares takes on all q^ pairs of symbols chosen from Q. 

Definition 4. A set of r squares {L^^\L^'^\ . . . , L^'')} (each 
of size qx q) is considered to be mutually-orthogonal if every 
pair of squares, L^"^\ L*-™ where m,m' = 1,2, ... ,r and 
m ^ m' , are orthogonal. 

When (7 is a prime or prime power, sets of mutually- 
orthogonal Latin squares can be derived by first identifying 
the generator of the multiplicative group associated with the 
finite field of characteristic q. That is, consider a Galois field 
GF{q) with primitive element a, so that the elements are 

eo = 0, ei = 1, 62 = a, 63 = 0;^, eg_i = a"^"^. 

Then the Latin squares, L(i),L(2),...,l(«-i), with entries 

L^j^ = ei+e„iej, Vm = 1,2, . . .,q-l, i,j = 0, 1, . . .,q-l 

are mutually-orthogonal with natural order zeroth column.0 

'"if Si 7^ i, then we can always reorder the rows of L'™' so that the 
zeroth column consists of the symbols {0, 1, . . . , g — 1} in sequential order 



If we let the q y. q matrix L^'^'^ consist of — i for 
all i,j G {0, 1, . . . , g — 1} (i.e., each column of i'*'^ is the 
same, and consists of symbols numbered sequentially), then 
the set of squares {L(°), L^^), L'^', . . . is a set of q 

mutually-orthogonal squares — where only L*^"^ is not Latin. 

Lemma 10. For {L^°\ . . . , L^'^^^^ as defined above, we 
have i^™'' = if and only if j — 0. That is, for any pair 

of squares, only the zeroth column has overlapping entries. 

Proof: Because all of the squares have the zeroth column 
in natural order, sufficiency is immediate. Now, since there 
are q entries in the zeroth column, and there are only q pairs 
of elements (a, h) such that a = b (where a, 6 £ Q), by the 
definition of mutually-orthogonal squares, we know that no 
other [non-zeroth] column will have overlapping entries. ■ 

Appendix B 
Proofs of Selected Lemmas 

A. Proof of Lemma Q] 

The lower bound on v can be seen by considering an 
arbitrary vertex y G Y. The vertex y must be connected 
to I distinct vertices of X; call this subset of vertices 
X C X. Now suppose that two vertices xi,X2 G X were 
also both connected to some other vertex y ^ y. Then the 
graph would have a cycle of length 4, consisting of vertices 
y ^ xi ^ y ^ X2 ^ y. Thus each vertex in X must be 
connected Xo k—\ unique vertices of Y; we let these vertices 
be Y, where |f | = Z(fc- 1). Because |{?/}uy| = 1 + 
(since y ^ Y), we establish the lower bound on v ~ \Y\. 

Now consider the l{k — 1) vertices of Y. These vertices 
must each be connected to only one vertex of X. Otherwise, 
a vertex y €Y connected to both xi G X and X2 & X would 
form the 4-cycle y ^ xi ~?/~X2 ^ y (similar to above). 
Therefore for any y €Y, the vertex must connect to at least 

1— l vertices of X\X. Let X consist of vertices in X\X such 
that allx G X are connected to some vertex in Y. Since there 
are at least l{k — — 1) edges between Y and X, and any 
vertex x e X has degree k, then \X\ > l{k- — l)/fc. As 
XnX = 0, so u = \X\ > \X\ + \X\ > l + l{l-l)(k~l)/k. 

B. Proof of Lemma |2] 

We show that there are no cycles of lengths 2 or 4 (since 
bipartite graphs have no odd cycles). Clearly, there are no 

2- cycles, since the graph is simple (i.e., no multiple edges). 
To show that there are no cycles of length 4, we con- 
sider vertices from each particular layer, and show that the 
construction results in no 4-cycle involving the vertices at 
that layer. For layer 0, there are no 4-cycles which include 
vertex y^, as layers 0, 1, and 2 form a tree of depth 3. Now 
consider any 4-cycles which include some vertex Xj from 
layer 1. Such a 4-cycle must also include yj^m and j}j,m' for 
some m ^ m! (and m, m' G {0, 1, . . . , fc — 2}). If j — 0, then 
stepHlof the algorithm guarantees that ijj^m and yj,m' do not 
connect to any layer 3 vertices in common. For j ^ 0, since 
any layer 3 vertex a;,„ ^ is connected to at most one vertex of 
{^j.M I M = 0, 1, . . . , fc - 2}, SO the vertices and 

can not be connected to the same layer 3 vertex for m^m' . 



This leaves 4-cycles consisting only of layer 2 and layer 3 
vertices. Suppose that a vertex yo,m is a member of a 4- 
cycle (for any m G {0,1,..., fc — 2}). Note that yo.m is 
only connected to the I — 1 vertices Xm,i, i = 0, 1, . . . , Z — 2. 
Because i^™) has Latin columns (even for L'-'^'), we see 
that the layer 3 vertices and Xm.i', where i ^ i' , will 
never connect to the same layer 2 vertex, i.e., y ^(m) 7^ 
y ^(m) for any j 0, 1, — 2. (Of course, Xm,i and 

Xm.i' are both connected to yo,m, but they are connected to 
no other common vertex.) Thus, j/o,m is not part of a 4-cycle. 

Now consider a potential 4-cycle consisting of vertices y^^^ 
and iij'.fi', where ^ and j ^ j'. Then there will be 



two layer 3 vertices 



and L\"j,'_-^ ~ p' = L^^-}_^. However, this would 
imply that the two squares L^™^ and L^™ have two separate 
columns, j — \ and f — 1, where overlapping entries between 
the two squares can be found; this contradicts Lemma [TO] 
since only the zeroth column has overlapping entries. Thus 
no 4-cycles exist which involve layer 2 vertices. 

Since layer 3 vertices must connect to layer 2 vertices, this 
implies that the shortest cycle consists of at least 6 vertices. 

C. Proof of Lemma |4] 

We want to show that all the vertices in Y have exactly 
l[n] — Pn{q) outgoing edges, and all the vertices in X have 
exactly k — q+l outgoing edges. Furthermore, \Y\ — v[n\ = 
Pn+i{q) and \X\ ^ u[n] ^ S^±iMe^, 

First we verify that we have the correct number of vertices. 
For Y, there is 1 layer vertex. In layer 2, we will have 
l{k — 1) = Pn{q)q = Pn+i{q) ~ 1 vertices, since each of 
the / vertices Xj, j = 0, 1, — 1, is connected to fc — 1 
different layer 2 vertices of Y . Thus v[n] = \Y\ = p„_|_i(g). 
For X, there are l[n\ ~ Pn{q) layer 1 vertices. For layer 3, in 
each of the u\n — 1] iterations of step|6] there are (fc — l)q = 
q^ distinct vertices of X involved. Thus, layer 3 consists 
of q^u\n — 11 = vertices. Therefore, u\n\ = 

y L J q+l ' L J 

\Y\ — r, (n) -i- 9 P..(g)Pn-l(g) _ P,^ + l('^)Pn(<?) 
\^ I — Pn(q) q+l - q+l 

Now we count the number of edges from each vertex. For 
layer 0, step |2] results in degree of l[n] ~ Pn{q) for vertex j/o- 
For layer 1, each vertex Xj, j = 0, 1, . . . , Z — 1, is connected 
to exactly q + l vertices (one edge to j/o and then q edges to 
the layer 2 vertices), as can be seen from step |3] 

Now consider a particular layer 2 vertex y.j^m- We know in 
the collection of subsets, B, that each element j is selected 
exactly Z[n — 1] = Pn-i{q) times; thus, yj^m occurs in exactly 
Z[n — 1] = Pn-i{q) iterations. Moreover, in each iteration that 
yj,rn occurs, it has exactly q edges to the layer 3 vertices 
(whether or not j is the 50 or some gj'+i of the current 
subset hh). Therefore, each layer 2 vertex has 1 edge to 
its corresponding layer 1 vertex, and qpn-i{q) edges to the 
layer 3 vertices, for a total of 1 + qpn~i{q) — Pn{q) edges — 
which is the desired degree for that vertex. 

By construction, every layer 3 vertex x^jll\ has degree q+l, 
that is, 1 edge from the associated yg„,m and q edges to the 
vertices y^ l*™)' J — 0,1, . . . , q — 1, connected via the 
Latin squares method. 
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